{"nbformat":4,"nbformat_minor":0,"metadata":{"anaconda-cloud":{},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.5.2"},"colab":{"name":"Tutorial VI_tf2.ipynb","provenance":[],"collapsed_sections":["SGyY2JPXtsq7","SH0SSbAftsq1","w7FnR5Kwtsq9","TvQjWvbWtsrH","HHSCevnwtsrQ","I3ojvV71pHJ6","4aqxQ_X-tsrV","7BldvImFYfm4","PCKy-0EotsrX","y7-p8ClctsrY"],"toc_visible":true},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"Lbpit0_rtspv","colab_type":"text"},"source":["# Tutorial VI: Recurrent Neural Networks"]},{"cell_type":"markdown","metadata":{"id":"C-Fr8e3Ltspx","colab_type":"text"},"source":["

\n","Bern Winter School on Machine Learning, 27-31 January 2020
\n","Prepared by Mykhailo Vladymyrov.\n","

\n","\n","This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License."]},{"cell_type":"markdown","metadata":{"id":"S0Upuaj8tspy","colab_type":"text"},"source":["In this session we will see what RNN is. We will use it to predict/generate text sequence, but same approach can be applied to any sequential data.\n"]},{"cell_type":"markdown","metadata":{"id":"7QUO3V3Stspz","colab_type":"text"},"source":["So far we looked at the data available altogether. In many cases the data is sequential (weather, speach, sensor signals etc).\n","RNNs are specifically designed for such tasks.\n","\n","\"drawing\"
\n","\n"]},{"cell_type":"markdown","metadata":{"id":"SGyY2JPXtsq7","colab_type":"text"},"source":["## 1. Load necessary libraries"]},{"cell_type":"code","metadata":{"id":"9jKCF9MpKWdF","colab_type":"code","colab":{}},"source":["# if using google colab\n","%tensorflow_version 2.x"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"DpRETZFNtsq7","colab_type":"code","colab":{}},"source":["import sys\n","\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import IPython.display as ipyd\n","import tensorflow as tf\n","import collections\n","import time\n","\n","# We'll tell matplotlib to inline any drawn figures like so:\n","%matplotlib inline\n","plt.style.use('ggplot')\n","\n","from IPython.core.display import HTML\n","HTML(\"\"\"\"\"\")\n","\n","physical_devices = tf.config.experimental.list_physical_devices('GPU')\n","tf.config.experimental.set_memory_growth(physical_devices[0], True)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SH0SSbAftsq1","colab_type":"text"},"source":["## unpack libraries\n","if using colab, run the next cell"]},{"cell_type":"code","metadata":{"id":"Grv04xmitsq2","colab_type":"code","colab":{}},"source":["p = tf.keras.utils.get_file('./material.tgz', 'https://scits-training.unibe.ch/data/tut_files/material.tgz')\n","!mv {p} .\n","!tar -xvzf material.tgz > /dev/null 2>&1"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"rLWt72gnKj4M","colab_type":"code","colab":{}},"source":["from utils import gr_disp"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"w7FnR5Kwtsq9","colab_type":"text"},"source":["## 2. Load the text data"]},{"cell_type":"code","metadata":{"id":"N3cmvKeatsq-","colab_type":"code","colab":{}},"source":["def read_data(fname):\n"," with open(fname) as f:\n"," content = f.readlines()\n"," content = [x.strip() for x in content]\n"," content = [word for i in range(len(content)) for word in content[i].split()]\n"," content = np.array(content)\n"," return content"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"m4KJQxKqtsrA","colab_type":"code","colab":{}},"source":["training_file = 'RNN/rnn.txt'"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"mMumjsH8tsrD","colab_type":"code","colab":{}},"source":["training_data = read_data(training_file)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"Q_GMc64ptsrF","colab_type":"code","colab":{}},"source":["print(training_data[:100])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"TvQjWvbWtsrH","colab_type":"text"},"source":["## 3. Build dataset\n","We will assign an id to each word, and make dictionaries word->id and id->word.\n","The most frequently repeating words have lowest id"]},{"cell_type":"code","metadata":{"id":"CqWfeze4tsrI","colab_type":"code","colab":{}},"source":["def build_dataset(words):\n"," count = collections.Counter(words).most_common()\n"," dictionary = {}\n"," for word, _ in count:\n"," dictionary[word] = len(dictionary)\n"," reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))\n"," return dictionary, reverse_dictionary"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"UxjTb5VUtsrK","colab_type":"code","colab":{}},"source":["dictionary, reverse_dictionary = build_dataset(training_data)\n","vocab_size = len(dictionary)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"HJt3d4lJtsrL","colab_type":"code","colab":{}},"source":["print(dictionary)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"e1jtdvAztsrN","colab_type":"text"},"source":["Then the whole text will look as a sequence of word ids:"]},{"cell_type":"code","metadata":{"id":"sNVe-0P_tsrO","colab_type":"code","colab":{}},"source":["words_as_int = [dictionary[w] for w in training_data]\n","print(words_as_int)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"HHSCevnwtsrQ","colab_type":"text"},"source":["## 4. Build model"]},{"cell_type":"markdown","metadata":{"id":"phW3ru-dUSyv","colab_type":"text"},"source":["We will build the model in TF2.\n","It will contain an embedding layer, and three LSTM layers.\n","Dense layer on top is used to output probability of the next word:"]},{"cell_type":"code","metadata":{"id":"S3Y8GDDqjvyx","colab_type":"code","colab":{}},"source":["# Parameters\n","n_input = 3 # word sequence to predict the following word\n","\n","# number of units in RNN cells\n","n_hidden = [256, 512, 128]\n","\n","model = tf.keras.Sequential()\n","model.add(tf.keras.layers.Embedding(vocab_size, 128, input_length=n_input))\n","\n","for n_h in n_hidden:\n"," model.add(tf.keras.layers.LSTM(n_h, return_sequences=True, name='lstm%d' % n_h))\n","\n","model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))\n","\n","model.compile(optimizer='RMSProp',\n"," loss='sparse_categorical_crossentropy',\n"," metrics=['accuracy'])\n","\n","W0 = model.get_weights() # to reset model to original state:\n","model.summary()"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"I3ojvV71pHJ6","colab_type":"text"},"source":["## 5. Data streaming"]},{"cell_type":"markdown","metadata":{"id":"z_EYsL_TVaBz","colab_type":"text"},"source":["Here we will see how to feed a dataset for model training:"]},{"cell_type":"code","metadata":{"id":"muEoZW05nFQt","colab_type":"code","colab":{}},"source":["# create tf.data.Dataset object\n","word_dataset = tf.data.Dataset.from_tensor_slices(words_as_int)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"kfSyUC6CnlZ8","colab_type":"code","colab":{}},"source":["# take metod generates elements:\n","for i in word_dataset.take(5):\n"," print(reverse_dictionary[i.numpy()])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"SBK7FWPFVwSP","colab_type":"text"},"source":["The `batch` method creates dataset, that generates sequences of elements:"]},{"cell_type":"code","metadata":{"id":"jtGgHXkPnc7P","colab_type":"code","colab":{}},"source":["sequences = word_dataset.batch(n_input+1, drop_remainder=True)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"bHOcJ_Wgn3n_","colab_type":"code","colab":{}},"source":["# helper for int-to-text conversion\n","to_text = lambda arr:' '.join([reverse_dictionary[it] for it in arr])"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"v0DYh6wNnuJr","colab_type":"code","colab":{}},"source":["for item in sequences.take(5):\n"," print(to_text(item.numpy()))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Pe6o55hyWnyX","colab_type":"text"},"source":[" The `map` method allows to use any function to preprocess the data:"]},{"cell_type":"code","metadata":{"id":"7mFJ4jNloJk7","colab_type":"code","colab":{}},"source":["def split_input_target(chunk):\n"," input_text = chunk[:-1]\n"," target_text = chunk[1:]\n","\n"," return input_text, target_text\n","\n","dataset = sequences.map(split_input_target)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"bbhGQ5k_XE5f","colab_type":"text"},"source":["The model will predict input_text -> target_text:"]},{"cell_type":"code","metadata":{"id":"6erRolZ-omdr","colab_type":"code","colab":{}},"source":["for input_example, target_example in dataset.take(1):\n"," print ('Input data: ', to_text(input_example.numpy()))\n"," print ('Target data:', to_text(target_example.numpy()))"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"74eDk9K_XXQ0","colab_type":"text"},"source":["Finally we will shuffle the items, and produce minibatches of 16 elements:"]},{"cell_type":"code","metadata":{"id":"LcDexYkspBa9","colab_type":"code","colab":{}},"source":["dataset = dataset.shuffle(10000).batch(16, drop_remainder=True)\n","dataset"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"GX7Fdz3PXzxY","colab_type":"text"},"source":["Let's test not trained model:"]},{"cell_type":"code","metadata":{"id":"QNa3Ysj3qOEh","colab_type":"code","colab":{}},"source":["for input_example_batch, target_example_batch in dataset.take(1):\n"," example_batch_predictions = model(input_example_batch)\n"," print(example_batch_predictions.shape, \"# (batch_size, sequence_length, vocab_size)\")"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"aV13OaTQqcVU","colab_type":"code","colab":{}},"source":["print('input: ', to_text(input_example_batch.numpy()[0]))\n","print('output:', to_text(target_example_batch.numpy()[0]))\n","print('pred: ', to_text(example_batch_predictions.numpy()[0].argmax(axis=1)))\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"4aqxQ_X-tsrV","colab_type":"text"},"source":["## 5. Train!"]},{"cell_type":"code","metadata":{"id":"Q7wT782Hrq-R","colab_type":"code","colab":{}},"source":["#model.set_weights(W0)\n","history = model.fit(dataset, epochs=200, verbose=1)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"3Dan-chKsvbU","colab_type":"code","colab":{}},"source":["def draw_history(hist):\n"," fig, axs = plt.subplots(1, 2, figsize=(10,5))\n"," axs[0].plot(hist.epoch, hist.history['loss'])\n"," if 'val_loss' in hist.history:\n"," axs[0].plot(hist.epoch, hist.history['val_loss'])\n"," axs[0].legend(('training loss', 'validation loss'))\n"," axs[1].plot(hist.epoch, hist.history['accuracy'])\n"," if 'val_accuracy' in hist.history:\n"," axs[1].plot(hist.epoch, hist.history['val_accuracy'])\n","\n"," axs[1].legend(('training accuracy', 'validation accuracy'))\n"," plt.show()"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"iAJu4rMys3da","colab_type":"code","colab":{}},"source":["draw_history(history)"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"7BldvImFYfm4","colab_type":"text"},"source":["## 6. Generating text with RNN"]},{"cell_type":"markdown","metadata":{"id":"b3kuXO1jYu-a","colab_type":"text"},"source":["Take word sequence and generate the following 128 words:"]},{"cell_type":"code","metadata":{"id":"uyTL_hrbxwj-","colab_type":"code","colab":{}},"source":["def gen_long(model, word_id_arr, n_words=128):\n"," out = []\n"," words = list(word_id_arr.copy())\n"," for i in range(n_words):\n"," keys = np.reshape(np.array(words), [-1, n_input])\n","\n"," onehot_pred = model(keys).numpy()[0]\n"," pred_index = onehot_pred.argmax(axis=1)\n"," pred = pred_index[-1]\n"," out.append(pred)\n","\n"," words = words[1:]\n"," words.append(pred)\n"," sentence = to_text(out)\n"," return sentence"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"R7gzaQ1FyRCE","colab_type":"code","colab":{}},"source":["for input_example_batch, target_example_batch in dataset.take(10):\n"," input_seq = input_example_batch.numpy()[0]\n"," sentence = gen_long(model, input_seq)\n"," print(to_text(input_seq), '...')\n"," print('\\t...', sentence, '\\n')"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"wGv_d3bzZB9a","colab_type":"text"},"source":["Or try to input some text and see continuation:"]},{"cell_type":"code","metadata":{"code_folding":[],"id":"_iX7hcrFtsrW","colab_type":"code","colab":{}},"source":["while True:\n"," prompt = \"%s words: \" % n_input\n","\n"," try:\n"," sentence = input(prompt)\n"," except KeyboardInterrupt:\n"," break\n","\n"," sentence = sentence.strip()\n"," words = sentence.split(' ')\n"," if len(words) != n_input:\n"," continue\n"," try:\n"," symbols_in_keys = [dictionary[str(words[i])] for i in range(len(words))]\n"," except:\n"," print(\"Word not in dictionary\")\n"," continue\n","\n"," sentence = gen_long(model, symbols_in_keys)\n"," print(sentence)\n"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"PCKy-0EotsrX","colab_type":"text"},"source":["## 7. Excercice \n"]},{"cell_type":"markdown","metadata":{"id":"OYXY5jfMAi6x","colab_type":"text"},"source":["* Run with 5-7 input words instead of 3.\n","* increase number of training iterations, since convergance will take much longer (training as well!)."]},{"cell_type":"markdown","metadata":{"id":"y7-p8ClctsrY","colab_type":"text"},"source":["## 8. Further reading"]},{"cell_type":"markdown","metadata":{"id":"mjhwEgdKtsrZ","colab_type":"text"},"source":["[Illustrated Guide to Recurrent Neural Networks](https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9)\n","\n","[Illustrated Guide to LSTM’s and GRU’s: A step by step explanation](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21)"]}]}